feat(docker): production-optimized multi-stage Dockerfile#90
Closed
pratikbin wants to merge 254 commits intochopratejas:mainfrom
Closed
feat(docker): production-optimized multi-stage Dockerfile#90pratikbin wants to merge 254 commits intochopratejas:mainfrom
pratikbin wants to merge 254 commits intochopratejas:mainfrom
Conversation
Use asyncio.run() instead of asyncio.get_event_loop().run_until_complete() which raises RuntimeError in Python 3.10+ when no event loop exists.
Move the network timeout skip handler to the main tests/conftest.py so it applies to all tests, not just tests/test_memory/*. Fixes flaky CI failures when HuggingFace model downloads timeout.
- Add mkdocs.yml with Material theme (indigo, professional) - Add docs/index.md landing page with quick install - Add GitHub Actions workflow for auto-deployment - Remove old docs/README.md (replaced by index.md)
- Add web dashboard at /dashboard endpoint with real-time stats - Simplify dashboard metrics to user-friendly terms (removed confusing CCR/TOIN terminology) - Track Headroom overhead separately from total latency - Add request logging to Bedrock paths (was missing) - Use package version (__version__) instead of hardcoded "1.0.0" - Add latency min/max tracking in addition to average Dashboard shows: requests, tokens saved, cost saved, overhead, providers breakdown, performance stats, and recent requests table.
- Add dashboard URL (http://localhost:8787/dashboard) to quickstart - Recommend headroom-ai[all] for best compression performance - Note that first startup downloads ML models (~500MB one-time)
## Description Add Headroom integration with AWS Strands Agents SDK, enabling automatic context optimization and tool output compression for Strands-based agents. Fixes chopratejas#14 ## Type of Change - [x] New feature (non-breaking change that adds functionality) - [x] Documentation update ## Changes Made ### Core Integration (`headroom/integrations/strands/`) - **HeadroomHookProvider** - Implements Strands `HookProvider` interface for automatic tool output compression via `AfterToolCallEvent`. Compresses verbose tool outputs before they enter conversation context. - **HeadroomStrandsModel** - Model wrapper that extends Strands `Model` base class for message-level optimization. Implements all required abstract methods: `stream()`, `get_config()`, `update_config()`, `structured_output()`. - **Provider auto-detection** - Automatically detects appropriate Headroom provider (Anthropic, OpenAI, Google) based on wrapped Strands model type. - **`strands-agents` as optional dependency** - Install with `pip install headroom-ai[strands]` ### Testing (`tests/integrations/test_strands/`) - **Real integration tests (25 tests)** - Use actual AWS Bedrock API calls with Claude 3 Haiku. Skip automatically when credentials unavailable. - **Unit tests (57 tests)** - Mock-based tests for internal logic, edge cases, and error handling. No credentials required. ### Demo (`examples/strands_bedrock_demo.py`) - Interactive demo showcasing both integration patterns - Visual before/after compression comparison with token savings - 4 verbose tools (search, logs, database, metrics) demonstrating real savings - Supports `--hook` and `--model` flags for individual demos ## Testing All tests verified: - [x] Unit tests pass (57 tests) - [x] Integration tests pass (25 tests with real Bedrock API) - [x] Linting passes (`ruff check .`) - [x] Type checking passes (`mypy headroom/integrations/strands/`) - [x] Formatting passes (`ruff format --check`) - [x] Demo runs successfully with ~50% token savings ## Test Output ``` $ pytest tests/integrations/test_strands/ -v =================== 82 passed in 90.09s =================== $ ruff check headroom/integrations/strands/ --ignore E402 All checks passed! $ mypy headroom/integrations/strands/ --ignore-missing-imports Success: no issues found ``` ## Demo Results ``` ╭────────────────────────────────────────────────────────────╮ │ HeadroomHookProvider Results │ │────────────────────────────────────────────────────────────│ │ Tokens BEFORE compression: 51,961 │ │ Tokens AFTER compression: 25,658 │ │ Tokens SAVED: 26,303 (50.6%) │ ╰────────────────────────────────────────────────────────────╯ ```
…tegration Feature/strands integration
DiffCompressor: - Parse unified diff format and compress by reducing context lines - Preserve file headers and all +/- change lines - Score hunks by relevance (error keywords, query matches) - Add summary line: [N files, +X -Y lines] - Expected 30-50% savings on typical git diffs - Wire into content router for CompressionStrategy.DIFF - 30 tests covering parsing, compression, edge cases hnswlib SIGILL fix: - Move hnswlib import from module level to lazy loading - hnswlib crashes with SIGILL (Illegal Instruction) on CPUs without AVX support, before Python can catch the error - Now imports only when HNSWVectorIndex is actually used - HNSW_AVAILABLE is checked lazily via __getattr__ Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
HTMLExtractor uses trafilatura to extract main content from HTML pages, removing scripts, styles, navigation, and ads. This achieves 94.9% compression while preserving 98.2% recall on the Scrapinghub benchmark. Key features: - Automatic HTML detection in content router - Configurable output format (markdown or text) - Metadata extraction (title, author, date, description) - Batch extraction support Evaluation framework: - OSS benchmark integration (Scrapinghub Article Extraction Benchmark) - LLM-as-judge evaluation for QA accuracy preservation - F1 score: 0.919 on 181-sample benchmark (baseline: 0.958) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rename client/response variables to be unique per provider branch to avoid type inference conflicts. Use getattr for Anthropic content block text access to handle union types. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add native support for OpenRouter API via LiteLLM backend - Introduce PROVIDER_REGISTRY pattern to eliminate scattered if/else blocks - New providers can now be added with a single registry entry Features: - `headroom proxy --backend openrouter` routes requests to OpenRouter - Pass-through model naming (anthropic/claude-3.5-sonnet, openai/gpt-4o, etc.) - CLI shows provider-specific setup instructions from registry Usage: export OPENROUTER_API_KEY="sk-or-v1-..." headroom proxy --backend openrouter Also fixes mypy type errors in mcp_server.py
…ic tools Add ability to exclude specific tools from compression, useful for CLI tools like Claude Code where file/search output should be passed through unmodified. Changes: - Add DEFAULT_EXCLUDE_TOOLS constant with Read, Grep, Glob, Bash, WebFetch, WebSearch - Add exclude_tools field to SmartCrusherConfig and ContentRouterConfig - Add _build_tool_name_map() to ContentRouter for tool_call_id -> name mapping - Skip compression for tool_result blocks from excluded tools - Support both Anthropic (tool_use/tool_result) and OpenAI (tool_calls/tool) formats This prevents Headroom from compressing output from tools where the user expects to see the full, unmodified content (e.g., file reads, search results). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add pytest.importorskip("trafilatura") to HTML extractor test modules
to skip tests gracefully when the optional trafilatura dependency is
not installed. This fixes CI failures in the base test matrix that
doesn't include the html extras.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous fix attempted to lazily import hnswlib but calling _check_hnswlib_available() still triggered the import, which crashed with SIGILL on CPUs without AVX support before Python could catch it. Fix by using subprocess to safely probe for hnswlib availability: - Import AND create an Index in a subprocess to catch SIGILL at both import time and first use of AVX instructions - If subprocess succeeds, then import in main process - Add debug logging for all failure modes (timeout, crash, etc.) - Isolates any crash to the subprocess, keeping test process alive AI review: code-reviewer (1 iteration) Adversarial review: code-critic (addressed logging, more robust probe) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add pytestmark skip conditions to memory test modules that depend on hnswlib (core_operations, factory, easy). The subprocess probe for hnswlib correctly detects unavailability on some platforms (like Python 3.13 CI runners), but these tests were still trying to run and failing with ImportError. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
## What this PR fixes 1. **CI Python 3.12 failure**: Added skip decorator to `TestLocalBackend` in `test_memory_system.py` - these tests require hnswlib which is not available on all CI runners. 2. **Missing test coverage**: Added 6 tests for the `exclude_tools` feature in `test_content_router.py`. Tests use existing helper functions `generate_python_code()`, `generate_json_data()`, and `generate_search_results()` defined at lines 57-95 of the same file. 3. **Anthropic/OpenAI inconsistency**: Fixed `_process_content_blocks()` to add `router:excluded:tool` marker for Anthropic format, matching the OpenAI format behavior at line 1157. 4. **Dead code removal**: Removed unused `exclude_tools` field from `SmartCrusherConfig` - the actual implementation uses `ContentRouterConfig.exclude_tools` in content_router.py. AI review: code-reviewer (2 iterations), adversarial-reviewer (2 iterations) Issues fixed: missing test coverage, format inconsistency, dead code Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
WebFetch and WebSearch should NOT be excluded by default because: 1. Web content is Headroom's sweet spot - lots of noise (nav, ads, boilerplate) 2. CCR allows retrieval if LLM needs original content 3. Excluding them undermines the core value proposition DEFAULT_EXCLUDE_TOOLS now only contains local file/code tools: - Read, Glob, Grep, Bash (and lowercase variants) These local tools return precise content (line numbers, paths, code) where exact fidelity matters immediately. Web tools benefit from compression and can use CCR for on-demand retrieval. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…e-tools-compression feat(compression): add exclude_tools to bypass compression for specific tools
- Add `question` parameter to LLMLinguaCompressor.compress() for QA-aware token selection (passes to LLMLingua-2's compress_prompt) - Flow `question` parameter through ContentRouter compression pipeline - Enable ContentRouter in default pipeline (was missing, causing 0% compression) - Add `content_router_enabled` config option to HeadroomConfig This improves compression accuracy for QA tasks by allowing LLMLingua-2 to preserve tokens relevant to answering the given question.
Root Cause: The `find_tool_units()` function in `parser.py` only detected OpenAI format tool calls (assistant.tool_calls + role="tool" messages), not Anthropic format (assistant.content[type=tool_use] + user.content[type=tool_result]). This caused RollingWindow and IntelligentContext transforms to treat Anthropic tool_use and tool_result as separate, independently droppable messages. When context needed to be trimmed, the assistant message with tool_use could be dropped while keeping the user message with tool_result, creating orphaned tool_result blocks. When sent to the Anthropic API, this produces the error: "unexpected tool_use_id found in tool_result blocks" Changes: 1. parser.py: Extended `find_tool_units()` to detect Anthropic format: - Scan user messages for content blocks with type="tool_result" - Scan assistant messages for content blocks with type="tool_use" - Map tool_use_id to corresponding response message indices 2. rolling_window.py: Extended `_get_protected_indices()` to protect Anthropic format tool pairs: - Detect tool_use blocks in assistant.content - Find and protect matching user messages with tool_result blocks 3. tests/test_parser.py: Added 4 new tests for Anthropic format: - test_anthropic_format_tool_use_and_result - test_anthropic_format_multiple_tool_uses - test_anthropic_format_orphaned_tool_result - test_mixed_openai_and_anthropic_formats Test Results: 82 passed (including 4 new Anthropic format tests) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ropic-tool-unit-detection fix: Handle Anthropic format tool_use/tool_result as atomic units
…ager Follows up on PR chopratejas#19 which fixed RollingWindow but missed IntelligentContextManager, the default context manager used by the proxy. Changes: 1. intelligent_context.py: Extended `_get_protected_indices()` to handle Anthropic format: - Scan assistant.content for type="tool_use" blocks - Protect user messages containing type="tool_result" blocks with matching tool_use_id 2. test_intelligent_context.py: Added TestAnthropicFormatToolProtection class with 5 tests: - test_anthropic_tool_result_protected_when_tool_use_protected - test_anthropic_tool_units_dropped_atomically - test_anthropic_multiple_tools_same_message_atomic - test_anthropic_format_no_api_error_scenario (verifies bug fix) - test_mixed_openai_and_anthropic_formats 3. test_rolling_window.py: Added matching TestAnthropicFormatToolProtection class with 5 tests - Added skip decorator for CI/CD when OPENAI_API_KEY is not set This ensures both context managers (RollingWindow and IntelligentContextManager) correctly handle Anthropic's native tool_use/tool_result format, preventing the "unexpected tool_use_id found in tool_result blocks" API error.
…ntext-anthropic-format fix: Extend Anthropic format tool protection to IntelligentContextManager
- Add MemoryToolAdapter for unified memory across providers - Anthropic: Uses native memory tool (memory_20250818) for subscription safety - OpenAI/Gemini/Others: Uses function calling format - All providers share the same semantic vector store backend - Simplify CLI to single --memory flag with auto-detection - Add proper resource cleanup (close methods) to fix test isolation - Update README with memory documentation
Implements comprehensive memory tracking for all in-memory components: - Add MemoryTracker singleton with ComponentStats, ProcessStats, MemoryReport - Add get_memory_stats() to CompressionStore, BatchContextStore, GraphStore, HNSWVectorIndex - Add /debug/memory API endpoint for runtime monitoring Components tracked: - compression_store: CCR compressed tool outputs - batch_context_store: Batch API request contexts - graph_store: Knowledge graph entities and relationships - vector_index: HNSW vector embeddings - semantic_cache: Response cache - request_logger: Request metadata Includes 47 tests (unit + integration) with real API calls.
…ervability Add memory observability system (Phase 1)
Replace unbounded InMemoryGraphStore with SQLite-backed implementation: - Persistent storage survives proxy restarts - Memory bounded by configurable SQLite page cache (default 8MB) - Same async interface as InMemoryGraphStore (drop-in replacement) - LocalBackend now uses SQLiteGraphStore by default (graph_persist=True) New files: - headroom/memory/adapters/sqlite_graph.py: SQLite graph store implementation - tests/test_sqlite_graph_store.py: 37 comprehensive tests Key features: - O(log n) lookups via database indexes - Case-insensitive entity name lookup per user - BFS subgraph traversal and shortest path finding - CASCADE delete for entity relationships - MemoryTracker integration via get_memory_stats()
…ervability Add SQLiteGraphStore for bounded, persistent graph storage
…d tool Bedrock requires role=tool messages immediately after assistant tool_calls. The previous fix inserted a user text message in between when the message contained both text and tool_result blocks, breaking the pairing. Drop text alongside tool_result (Claude Code never sends it in practice). Added ordering regression tests for the Bedrock constraint.
The /v1/responses handler was passing through without compression, meaning Codex CLI users got zero savings. Now converts Responses API items (function_call, function_call_output, reasoning, message) to Chat Completions format, runs the existing pipeline, and converts back. - New: headroom/proxy/responses_converter.py — pure conversion functions - 21 unit tests + 3 integration tests (tested with real OpenAI API) - Preserves reasoning items, images, unknown types verbatim - Skips compression when previous_response_id is set - 27% compression on real Codex-pattern payloads (500 records → 14K tokens saved) Closes chopratejas#73
Private scripts with credentials should not be tracked in git.
Codex v0.117.0+ with newer models uses WebSocket instead of HTTP POST
for the Responses API. Added @app.websocket("/v1/responses") handler that:
- Accepts ws:// connections and forwards to wss://api.openai.com
- Compresses input on first message using existing pipeline
- Relays all response events bidirectionally
- Handles SSL (certifi), graceful disconnect, missing websockets lib
Tested with real OpenAI API: basic text, large tool outputs (200 records),
parallel function calls, instructions preservation.
Addresses chopratejas#79
KompressCompressor now tries ONNX Runtime first (156MB INT8 model), falls back to PyTorch only if ONNX unavailable. No torch needed for text compression — just onnxruntime (~50MB) + transformers (tokenizer). Changes: - Add onnxruntime + transformers to [proxy] extra in pyproject.toml - Add _OnnxModel wrapper with get_scores/get_keep_mask interface - _load_kompress() tries ONNX first, falls back to PyTorch - is_kompress_available() returns True if EITHER backend available - compress() handles both numpy (ONNX) and tensor (PyTorch) outputs Dependency impact: Before: pip install headroom-ai[proxy] → no text compression After: pip install headroom-ai[proxy] → Kompress ONNX INT8 (156MB) [ml] extra still available for full PyTorch (600MB, GPU support)
- WebSocket proxy for /v1/responses (Codex gpt-5.4+ support) - Kompress ONNX INT8 text compression (no torch needed, ~100MB vs 1.5GB) - Updated Discord invite link - Tool_result ordering fix for Bedrock
…thropic-api-url flag
OpenAI's WebSocket Responses API requires the header 'OpenAI-Beta: responses-api=v1'. Without it, the server returns HTTP 500 on the WebSocket upgrade. Also forward all client headers (not just auth) to upstream, skipping only hop-by-hop headers. Tested with real OpenAI API: basic text, large tool output compression, all working through WebSocket proxy. Fixes chopratejas#82, updates chopratejas#79
…history feat: persist proxy savings history
savings_usd is now tokens_saved * model list input price (monotonic, transparent). Removed non-monotonic moving-average repricing and confusing cost_without_headroom counterfactual. Dashboard hero shows "Compression Savings" with clear subtitle. Savings Breakdown section shows compression, cache, and RTK separately with distinct colors and no scope mixing. All beacon/telemetry fields preserved. RTK token counts still reported. Fixes chopratejas#83
…0.5.17 - Fix beacon spam: file lock ensures only one beacon per proxy regardless of worker count. Workers > 1 caused N beacons firing N rows per cycle. - Beacon upsert: on_conflict=session_id prevents duplicate rows. - Beacon stop() guard: skip final report if uptime < 2 minutes. - Fix dashboard cost: savings_usd now uses model list price (monotonic), not moving average. Separate breakdown for compression/cache/rtk. Fixes chopratejas#83
feat: Add Anthropic url overrides via env vars and cli flags
transforms_summary is a counted dict (e.g. {"router:tool_result:text": 4})
alongside the raw transforms_applied list. Cleaner display for users
without losing the raw data for debugging.
Move ProxyConfig, RequestLog, CacheEntry, RateLimitState to headroom/proxy/models.py. Re-exported from server.py for backward compatibility — all existing imports continue to work. server.py: 8835 → 8643 lines (-192) models.py: 199 lines (new) Part of the server.py split effort to improve maintainability.
…nd forwarding - Fix WS /v1/responses: forward Sec-WebSocket-Protocol (subprotocol) to upstream instead of stripping it — root cause of Codex HTTP 500 errors - Fix WS relay: handle binary messages properly instead of crashing on .decode(), add debug logging instead of silent except:pass - Add Authorization header fallback from OPENAI_API_KEY env var for WS - Extract response body from websockets InvalidStatus for error debugging - Fix streaming /v1/responses: pass optimized_tokens (not original_tokens twice) so compression savings appear in streaming metrics - Fix hardcoded provider="bedrock" in 4 metrics/log locations — now uses self.anthropic_backend.name so LiteLLM backends report correctly - Forward --backend, --anyllm-provider, --region flags from wrap commands (codex, aider) to the proxy subprocess via _start_proxy() - Forward API key from request headers to LiteLLM acompletion() calls - Forward region to Vertex AI (vertex_location) not just Bedrock - Redesign proxy startup banner: show routing table instead of misleading "Backend: Anthropic" label Closes chopratejas#86 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CacheAligner was hardcoded to enabled=True in the pipeline despite the config default being False. It extracts dynamic content from the system prompt middle and reinserts at the end, which: 1. CHANGES the prefix bytes → provider cache miss (loses 90% discount) 2. ADDS ~341 tokens of formatting overhead per request 3. Net effect: more expensive, worse caching Now uses the config default (enabled=False). The CacheAligner still exists for users who explicitly opt in, but the proxy no longer forces it on.
Manual ssl.create_default_context() + certifi doesn't load the Windows system certificate store, causing HTTP 500 on wss:// connections to OpenAI. Using ssl=True lets the websockets library handle SSL natively with proper cross-platform cert store loading.
Multi-stage build with uv, non-root user, and layer-optimized caching. - Multi-stage: build deps (gcc/g++) stay in builder, runtime is clean slim - uv instead of pip: uses existing uv.lock for deterministic fast installs - Layer caching: deps cached independently from source (rebuild 37s -> 4s) - Non-root: runs as headroom:1000 instead of root - Image size: 1.11GB -> 514MB (-54%) - Add CI workflow for multi-arch (amd64+arm64) GHCR publishing - Expand .dockerignore to exclude JS artifacts, IDE files, Docker files Closes chopratejas#89
Owner
|
Thanks for the changes. Regd. the UI - we already have a bare bones dashboard in Headroom - Do you think we should augment that? |
52550e4 to
653303f
Compare
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
python:3.11-slimwith onlycurluvwith lockfile instead of rawpipfor deterministic, fast installs with build cache mountsheadroom:1000user instead of rootENTRYPOINT/CMDseparation for clean docker-compose overrideslinux/amd64+linux/arm64) image publishing to GHCR.dockerignoreto exclude JS artifacts, IDE files, Docker meta-filesBenchmarks (cold build, Apple Silicon / OrbStack)
Closes #89
Test plan
docker build -t headroom:local .)docker buildx build --platform linux/arm64)/healthreturns 200--helpworks inside container